Auto

Ch 02 - Q9 (applied)
Description
Gas mileage, horsepower, and other information for 392 vehicles.

Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the 1983 American Statistical Association Exposition.

References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
(Ch 02 - Statistical Learning - Applied Exercises - Problem 9)

Short description of variables

  • mpg : miles per gallon
  • cylinders : Number of cylinders between 4 and 8
  • displacement : Engine displacement (cu. inches)
  • horsepower : Engine horsepower
  • weight : Vehicle weight (lbs.)
  • acceleration : Time to accelerate from 0 to 60 mph (sec.)
  • year : Model year (modulo 100)
  • origin : Origin of car (1. American, 2. European, 3. Japanese)
  • name : Vehicle name
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1) Load packages

In [1]:
Some preliminary workings
In [2]:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2) Import Data

In [3]:
  1. 397
  2. 9
A data.frame: 6 × 9
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
<dbl><int><dbl><chr><int><dbl><int><int><chr>
1188307130350412.0701chevrolet chevelle malibu
2158350165369311.5701buick skylark 320
3188318150343611.0701plymouth satellite
4168304150343312.0701amc rebel sst
5178302140344910.5701ford torino
6158429198434110.0701ford galaxie 500
In [4]:
FALSE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3) Data preparation

In [5]:
'data.frame':	397 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : chr  "130" "165" "150" "150" ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

The fact that a column containing numbers (horsepower) has been saved as 'chr' is a red flag. This may happen when all the elements in a column are not numerical. This column will have to be further examined.

In [6]:
Warning message in which(is.na(as.numeric(auto$horsepower))):
"NAs introduced by coercion"
A data.frame: 5 × 9
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
<dbl><int><dbl><chr><int><dbl><int><int><chr>
3325.04 98?204619.0711ford pinto
12721.06200?287517.0741ford maverick
33140.94 85?183517.3802renault lecar deluxe
33723.64140?290514.3801ford mustang cobra
35534.54100?232015.8812renault 18i

Since the number of missing values is very small, those rows can just be deleted.

In [7]:
  1. 33
  2. 127
  3. 331
  4. 337
  5. 355
In [8]:
A data.frame: 5 × 9
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
<dbl><int><dbl><chr><int><dbl><int><int><chr>
3325.04 98?204619.0711ford pinto
12721.06200?287517.0741ford maverick
33140.94 85?183517.3802renault lecar deluxe
33723.64140?290514.3801ford mustang cobra
35534.54100?232015.8812renault 18i
In [9]:
0
  1. 392
  2. 9
'character'
In [10]:
'data.frame':	392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(a) Which of the predictors are quantitative, and which are qualitative?

Quantitative → numerical values.
Qualitative → values in one of K different classes, or categories.

In [11]:
mpg
127
cylinders
5
displacement
81
horsepower
93
weight
346
acceleration
95
year
13
origin
3
name
301
variable description variable type
mpg miles per gallon quantitative
cylinders Number of cylinders between 4 and 8 qualitative or categorical
displacement Engine displacement (cu. inches) quantitative
horsepower Engine horsepower quantitative
weight Vehicle weight (lbs.) quantitative
acceleration Time to accelerate from 0 to 60 mph (sec.) quantitative
year Model year (modulo 100) quantitative
origin Origin of car (1. American, 2. European, 3. Japanese) qualitative or categorical
name Vehicle name qualitative or categorical

"year" can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(b) What is the range of each quantitative predictor?

In [12]:
  1. 'mpg'
  2. 'displacement'
  3. 'horsepower'
  4. 'weight'
  5. 'acceleration'
  6. 'year'
In [13]:
A data.frame: 3 × 6
mpgdisplacementhorsepowerweightaccelerationyear
<dbl><dbl><dbl><dbl><dbl><dbl>
min 9.0 68 461613 8.070
max46.6455230514024.882
range37.6387184352716.812
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(c) What is the mean and standard deviation of each quantitative predictor?

In [14]:
A data.frame: 5 × 6
mpgdisplacementhorsepowerweightaccelerationyear
<dbl><dbl><dbl><dbl><dbl><dbl>
min 9.00 68.00 46.001613.00 8.0070.00
max46.60455.00230.005140.0024.8082.00
range37.60387.00184.003527.0016.8012.00
mean23.45194.41104.472977.5815.5475.98
sd 7.81104.64 38.49 849.40 2.76 3.68
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(d) Range, mean and standard deviation after removing observations 10-85

In [15]:
  1. 316
  2. 6
TRUE
In [16]:
A data.frame: 5 × 6
mpgdisplacementhorsepowerweightaccelerationyear
<dbl><dbl><dbl><dbl><dbl><dbl>
min11.000 68.000 46.0001649.000 8.50070.000
max46.600455.000230.0004997.00024.80082.000
range35.600387.000184.0003348.00016.30012.000
mean24.404187.241100.7222935.97215.72777.146
sd 7.867 99.678 35.709 811.300 2.694 3.106
In [17]:
mpg
7711.8
displacement
59168
horsepower
31828
weight
927767
acceleration
4969.7
year
24378
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(e) Graphical examination of predictors

In [18]:
In [19]:
Observations:
- acceleration appears to be roughly normally distributed.
- Increase in no. of cylinders leads to
 • lower : mpg, acceleration
 • higher : displacement, horsepower, weight
- There has been a decline in the no. of new models coming out with 8 cylinders.
- Newer models are lighter and and have loss horsepower (presumably because of decreased weight).
- mpg appears to have strong (non-linear) relationships with displacement, horsepower, weight and is negatively correlated with the 3 variables.
- mpg of new models has imporoved over the years.
- displacement, horsepower and weight appear to have a strong positive correlation with each other.
- A moderate negative correlation may exist between horsepower and acceleration.
In [27]:
In [21]:
In [22]:
Observations:
- Cars of European (2) and Japenese (3) origin can be seen to be overlapping in many criteria whereas American cars (1) have a larger and distinct spread.
- Clear distinctions can be seen between American and the other 2 carmakers in displacement, horsepower and weight.
'################ workings
In [23]:
  3   4   5   6   8 
  4 199   3  83 103 
In [24]:
A data.frame: 13 × 6
year34568
<int><dbl><dbl><dbl><dbl><dbl>
700 70 418
710120 8 7
721140 013
731110 820
740150 6 5
75012012 6
76015010 9
771140 5 8
78017112 6
790121 610
801231 2 0
810200 7 1
820270 3 0
ggplot with plotly
In [25]:
ggplot
In [26]:
'################ workings'
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(f) Variables useful in predicting mpg

Except for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative.
Positive : year
Negative : cylinders, displacement, horsepower, weight
Non-directional : origin
They can be taken into account for predicting mpg, after adjusting for collinearity.